5000 Fastest Growing Private Companies in the U.S.
## [1] "row_num" "id" "rank" "workers" "company"
## [6] "url" "state_l" "state_s" "city" "metro"
## [11] "growth" "revenue" "industry" "yrs_on_list"
5000 companies observed over 14 variables listed above
## [1] 5000 14
## 'data.frame': 5000 obs. of 14 variables:
## $ row_num : int 0 1 2 3 4 5 6 7 8 9 ...
## $ id : int 22890 25747 25643 26098 26182 22913 22937 25413 26079 25861 ...
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ workers : int 227 191 145 62 92 50 129 130 264 11 ...
## $ company : Factor w/ 5000 levels "(add)ventures",..: 1725 3569 3651 4211 79 3520 1094 3357 4703 1826 ...
## $ url : Factor w/ 5000 levels "@properties",..: 1725 3569 3647 4211 76 3520 1094 3357 4703 1826 ...
## $ state_l : Factor w/ 51 levels "Alabama","Alaska",..: 5 5 48 5 22 20 5 3 38 34 ...
## $ state_s : Factor w/ 51 levels "AK","AL","AR",..: 5 5 47 5 20 22 5 4 38 28 ...
## $ city : Factor w/ 1352 levels "Acton","Ada",..: 355 355 37 930 737 46 1142 1103 979 1325 ...
## $ metro : Factor w/ 326 levels "","Adrian MI",..: 171 171 314 262 39 163 261 226 229 321 ...
## $ growth : num 158957 57348 55460 26043 20690 ...
## $ revenue : num 195640000 82640563 85076502 35293000 77652360 ...
## $ industry : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
## $ yrs_on_list: int 2 1 1 1 1 2 2 1 1 1 ...
Explore factor variables and the different levels in State and Industry
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Puerto Rico" "Rhode Island" "South Carolina"
## [43] "South Dakota" "Tennessee" "Texas"
## [46] "Utah" "Vermont" "Virginia"
## [49] "Washington" "West Virginia" "Wisconsin"
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Puerto Rico" "Rhode Island" "South Carolina"
## [43] "South Dakota" "Tennessee" "Texas"
## [46] "Utah" "Vermont" "Virginia"
## [49] "Washington" "West Virginia" "Wisconsin"
## row_num id rank workers
## Min. : 0 Min. : 4 5000 : 1 Min. : 0
## 1st Qu.:1250 1st Qu.:19575 4999 : 1 1st Qu.: 24
## Median :2500 Median :23292 4998 : 1 Median : 50
## Mean :2500 Mean :20037 4997 : 1 Mean : 209
## 3rd Qu.:3749 3rd Qu.:25370 4996 : 1 3rd Qu.: 125
## Max. :4999 Max. :26620 4995 : 1 Max. :34219
## (Other):4994
## company url state_l
## (add)ventures : 1 @properties : 1 California: 694
## @Properties : 1 110-consulting: 1 Texas : 404
## 110 Consulting: 1 123stores : 1 New York : 335
## 123Stores : 1 180 : 1 Florida : 303
## 180 : 1 180fusion : 1 Virginia : 284
## 180Fusion : 1 1seocom : 1 Illinois : 238
## (Other) :4994 (Other) :4994 (Other) :2742
## state_s city metro growth
## CA : 694 New York : 178 New York City: 399 Min. : 42.45
## TX : 404 Chicago : 95 Washington DC: 316 1st Qu.: 84.21
## NY : 335 Atlanta : 94 Los Angeles : 274 Median : 151.72
## FL : 303 Austin : 87 Chicago : 224 Mean : 516.44
## VA : 284 San Diego: 80 Atlanta : 194 3rd Qu.: 347.65
## IL : 238 Houston : 76 Dallas : 169 Max. :158956.91
## (Other):2742 (Other) :4390 (Other) :3424
## revenue industry yrs_on_list
## Min. : 1953000 IT Services : 733 Min. : 1.000
## 1st Qu.: 4876791 Advertising & Marketing : 453 1st Qu.: 1.000
## Median : 10722077 Business Products & Services: 435 Median : 2.000
## Mean : 43058182 Health : 377 Mean : 2.744
## 3rd Qu.: 26952131 Software : 338 3rd Qu.: 4.000
## Max. :5528202691 Financial Services : 278 Max. :12.000
## (Other) :2386
First plot doesn’t have small enough binwidths to see the trend. Reduce binwidth shows a histogram plot that skews right. What happens to distribution if I perform a long10 transformation?
Transforming the long tail by taking the log10 of workers helps better understand the distribution of workers. The transformed workers distribution looks close to a normal distribution with a longer tail on the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 24 50 209 125 34220
Top industry is IT Services with almost 800 companies. IT Services is the most represented industry by a large margin. The next two industries with greatest number of companies is Ad & Marketing and Business Products & Services with just over 400 companies each, counts that are just over half of IT Services.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
The revenue distribution is really skewed right with a very long tail. A log10 transformation and adjusting bin width provides a more natural way to see revenue data and illustrate trends in the data. However, even after a log10 transformation, the data is still skewed to the right. Removing extremely high revenue outliers helps show a more normal distribution.
Part of the reason the distribution doesn’t look entirely normal is because the log-normal distribution looks truncated on the left side. This is likely due to the dataset containing only the top 5000 companies. If the data extended to 10,000, for example, the curve will likely look more normally distributed.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42.45 84.21 151.70 516.40 347.70 159000.00
The long tail skew to the right justifies log10 transformation.
The distribution of growth and revenue look really similar. Let’s try another type of plot to tease apart how the distributions differ. The frequency polygon plot better shows the different shapes of the distributions. The amount of growth is based on revenue generated so it is not surprising the two distributions are similar since they are highly correlated.
## Loading required package: grid
Many of the highest ranked companies are small businesses. This could be because smaller companies grow faster than big public companies. But it could also be that smaller companies are starting with smaller amounts of revenues. Absolute growth in dollars is different from percentage growth. For example, company with no revenue the previous year that gains some revenue the next year has infinite percentage growth. But this isn’t a good reflection on how much revenue the company is generating compared to another company that’s making more in absolute revenue but has a lower percentage growth.
I created two new variables, revenue 2013, calculated in terms of current revenue and percentage growth to derive last year’s revenue, and growth in dollars, which is revenue 2013 subtracted from revenue 2014.
## [1] 123000 143853 153125 135000 373500 690697
## [1] 195517000 82496710 84923377 35158000 77278860 137286506
There is a limitation in my data set. Without data about resident populations in each state or city or metro area it is hard to determine whether the states with the highest number of growing companies have growing companies because there are more people living there or if there is something special about that state that fosters growth. Therefore, I looked for population data from the U.S. Census Bureau and found population estimates for 2010 to 2014. This works with the company data from 2014 with the reverse engineered revenue and growth numbers I calculated for 2013.
The structure of the new dataset of state population data:
## Geographic_Area Census_April1 Estimate_Base Est_2010 Est_2011
## 1 United States 308,745,538 308,758,105 309,347,057 311,721,632
## 2 Northeast 55,317,240 55,318,348 55,381,690 55,635,670
## 3 Midwest 66,927,001 66,929,898 66,972,390 67,149,657
## 4 South 114,555,744 114,562,951 114,871,231 116,089,908
## 5 West 71,945,553 71,946,908 72,121,746 72,846,397
## 6 Alabama 4,779,736 4,780,127 4,785,822 4,801,695
## Est_2012 Est_2013 Est_2014
## 1 314,112,078 316,497,531 318,857,056
## 2 55,832,038 56,028,220 56,152,333
## 3 67,331,458 67,567,871 67,745,108
## 4 117,346,322 118,522,802 119,771,934
## 5 73,602,260 74,378,638 75,187,681
## 6 4,817,484 4,833,996 4,849,377
Using dplyr, I can create a new dataset that aggregates all growth and revenue numbers for companies by state and calculates the growth per capita.
The variables and structure of the new dataset.
## [1] "state_l" "state_growth_dollar" "state_population2014"
## [4] "growth_per_capita"
## 'data.frame': 51 obs. of 4 variables:
## $ state_l : Factor w/ 51 levels "Alabama","Alaska",..: 5 45 14 33 36 48 10 22 23 44 ...
## $ state_growth_dollar : num 18309472149 17237538957 7298472080 6272947980 6133911042 ...
## $ state_population2014: num 38802500 26956958 12880580 19746227 11594163 ...
## $ growth_per_capita : num 472 639 567 318 529 ...
I have two datasets. The original dataset is a list of the 5000 fastest growing private companies in 2014 in the U.S. from Inc. 5000. The second dataset I have is state population data from the Census Bureau. I have two resulting data frames: companies is the Inc. 5000 data set with new variables added, and state_growth is population data with additional variables.
The variables most interesting to explore are the growth in percentage and dollar amounts since the dataset from Inc. 5000 is specifically about the fastest growing private companies in the U.S. I am also very interested in the industry the companies are in.
Revenue will be important way to understand growth. For example, a company with a small revenue will see greater gains in percentage growth than a company with larger revenue amount but the latter could have a much greater revenue and growth in absolute dollar amounts. So it is critical to interpret growth in light of revenue.
State population data is also important to better understand growth. A larger state might appear to have greater growth in absolute dollar amounts but that could be influenced by a greater population. Therefore investigating growth per capita can provide a fairer way to look at growth, especially from the point of view of smaller states.
I created 4 new variables from existing variables across two datasets I created two new variables in the companies data frame: 1. revenue2013, 2. growth_dollar. I reverse engineered revenue from 2013 using revenue from 2014 and percentage growth. Then I subtracted the 2013 revenue from 2014 revenue to get the growth_dollar.
I also created a new dataframe using the state population data from the census. In this dataframe, I added two other variables: 3. state_growth_dollar and 4. growth_per_capita. state_growth_dollar was calculated by grouping together states and summing the growth_dollar derived from the 2nd variable I created growth_dollar. The growth_per_capita variable was created by dividing growth_dollar by the state population.
The revenue, growth, and workers histograms all skewed right with a very long tail. I had to perform a log transformation to better understand the data. I performed a lot of tidying and adjusting to import and join the two data frames, including converting the population data to a numeric because the commas that separated the thousands place was causing the read.csv() command to import population numbers as characters. I needed population numbers to be numeric so I could perform division to calculate the growth_per_capita.
The structure of the two datasets: 1. State population and aggregate growth of companies 2. 5000 fastest growing companies and the attributes that describe them
## 'data.frame': 5000 obs. of 16 variables:
## $ row_num : int 0 1 2 3 4 5 6 7 8 9 ...
## $ id : int 22890 25747 25643 26098 26182 22913 22937 25413 26079 25861 ...
## $ rank : Ord.factor w/ 5000 levels "5000"<"4999"<..: 5000 4999 4998 4997 4996 4995 4994 4993 4992 4991 ...
## $ workers : int 227 191 145 62 92 50 129 130 264 11 ...
## $ company : Factor w/ 5000 levels "(add)ventures",..: 1725 3569 3651 4211 79 3520 1094 3357 4703 1826 ...
## $ url : Factor w/ 5000 levels "@properties",..: 1725 3569 3647 4211 76 3520 1094 3357 4703 1826 ...
## $ state_l : Factor w/ 51 levels "Alabama","Alaska",..: 5 5 48 5 22 20 5 3 38 34 ...
## $ state_s : Factor w/ 51 levels "AK","AL","AR",..: 5 5 47 5 20 22 5 4 38 28 ...
## $ city : Factor w/ 1352 levels "Acton","Ada",..: 355 355 37 930 737 46 1142 1103 979 1325 ...
## $ metro : Factor w/ 326 levels "","Adrian MI",..: 171 171 314 262 39 163 261 226 229 321 ...
## $ growth_percentage: num 158957 57348 55460 26043 20690 ...
## $ revenue2014 : num 195640000 82640563 85076502 35293000 77652360 ...
## $ industry : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
## $ yrs_on_list : int 2 1 1 1 1 2 2 1 1 1 ...
## $ revenue2013 : num 123000 143853 153125 135000 373500 ...
## $ growth_dollar : num 195517000 82496710 84923377 35158000 77278860 ...
## 'data.frame': 5000 obs. of 7 variables:
## $ workers : int 227 191 145 62 92 50 129 130 264 11 ...
## $ growth_percentage: num 158957 57348 55460 26043 20690 ...
## $ revenue2014 : num 195640000 82640563 85076502 35293000 77652360 ...
## $ industry : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
## $ yrs_on_list : int 2 1 1 1 1 2 2 1 1 1 ...
## $ revenue2013 : num 123000 143853 153125 135000 373500 ...
## $ growth_dollar : num 195517000 82496710 84923377 35158000 77278860 ...
The state growth in dollars shows most states clustered in the same area under $7.5 Billion. However, there are 2 states, California and Texas, with a extremely large amount of growth at $17-18 Billion. But when looking at state growth in dollars per capita, the top two states are Virginia and Colorado with Texas trailing closely behind Colorado. How does population affect growth? Future plots should explore the relationship between state population and revenue as well as state population and growth to uncover other trends.
## $title
## [1] "Revenue growth by state, normalized by population"
##
## attr(,"class")
## [1] "labels"
As concluded earlier, Virginia, Colorado, and Texas have the fastest growth in dollars per capita. California is trailing at #13.
Note: Refer to the frequency polygon and density plots in the Univariate section to see the differences in distribution between revenue in 2014 and percentage growth.
The relationship between revenue in 2014 and growth appears to be strongly correlated based on the Pearson’s r value, 0.95 for 2014 revenue and growth in dollars.
##
## Pearson's product-moment correlation
##
## data: companies$revenue2014 and companies$growth_dollar
## t = 208.4471, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9440788 0.9498019
## sample estimates:
## cor
## 0.9470155
A highly correlated relationship also exists between revenue in 2013 and growth in dollars, with a Pearson’s r correlation of 0.77. However this relationship is weaker than the relationship between revenue in 2014 and growth, which is expected since the dataset is focusing on fastest growing companies in 2014. Growth in 2014 is clearly tied to revenue in 2013 hence the relationship between growth and revenue in 2013 is expectedly high.
##
## Pearson's product-moment correlation
##
## data: companies$revenue2013 and companies$growth_dollar
## t = 84.3823, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7548497 0.7777247
## sample estimates:
## cor
## 0.7665302
Growth measured in dollars and percentage by 2013 Revenue plots with a log10 transformation have a very unusual fan-shaped distribution. Contrasting with the 2014 Revenue plots which are more cone-shaped. This trend could be explained by the same trend marking why the revenue histogram appeared to be truncated on the left side. This fan-shaped is likely a result of the fact that companies with 2013 revenue less $1,000,000 were excluded from the list.
Plots looking at the feature number of years a company has appeared on the Inc. 5000 list.
## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing values
## (geom_point).
##
## Pearson's product-moment correlation
##
## data: companies$yrs_on_list and companies$revenue2014
## t = 10.7641, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1233182 0.1775016
## sample estimates:
## cor
## 0.150523
## Warning in loop_apply(n, do.ply): Removed 124 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 358 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 21 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 830 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 816 rows containing missing
## values (stat_summary).
## Warning in loop_apply(n, do.ply): Removed 827 rows containing missing
## values (geom_point).
##
## Pearson's product-moment correlation
##
## data: companies$workers and companies$growth_dollar
## t = 16.0263, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1945545 0.2472854
## sample estimates:
## cor
## 0.2210815
## Warning in loop_apply(n, do.ply): Removed 852 rows containing missing
## values (stat_summary).
## Warning in loop_apply(n, do.ply): Removed 862 rows containing missing
## values (geom_point).
##
## Pearson's product-moment correlation
##
## data: companies$workers and companies$growth_percentage
## t = -0.8873, df = 4998, p-value = 0.3749
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04025575 0.01517411
## sample estimates:
## cor
## -0.01255046
Relationship between workers and growth is quite weak. This makes sense as the number of workers isn’t necessarily predictive of growth compared to revenue or even the product being produced.
## Warning in loop_apply(n, do.ply): Removed 69 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 469 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 65 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 362 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 9 rows containing missing values
## (geom_point).
##
## Pearson's product-moment correlation
##
## data: state_growth$state_population2014 and state_growth$state_growth_dollar
## t = 18.5743, df = 49, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8895717 0.9630005
## sample estimates:
## cor
## 0.9357539
Generally the trend appears to be that the higher the population, the greater the total state growth in dollars. This theory is supported by the highly Pearson r correlation between state population in 2014 and state growth dollar variables, 0.94.
## Warning in loop_apply(n, do.ply): Removed 1 rows containing missing values
## (geom_point).
##
## Pearson's product-moment correlation
##
## data: state_growth$state_population2014 and state_growth$growth_per_capita
## t = 2.6938, df = 49, p-value = 0.009647
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.09274621 0.57756852
## sample estimates:
## cor
## 0.3591502
On the other hand, the correlation between state population and growth per capita didn’t have as strong of a relationship, Pearson’s r of 0.36.
## Warning in loop_apply(n, do.ply): Removed 37 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 37 rows containing non-finite
## values (stat_ydensity).
There is a very strong relationship between 2014 revenue and growth. There is a weaker but still very strong relationship between 2013 revenue and growth.
The other features didn’t seem to affect growth as much as revenue, trailing behind by a lot.
It looks like there is some positive relationship between state population and growth per capita with a Pearson’s r correlation of 0.36. However, the scatter plot of the two variables didn’t look promising so I think analysis of state population and growth is a dead end.
The relationship between workers and growth seems to suggest there is a very weak relationship. The plots don’t follow a positive or negative relationship and the Pearson’s r value, 0.22, suggest that there is a very slightly positive relationship but it is very weak.
Most companies have been on the fastest growing companies list less than 3 times. This suggest that there isn’t much repeat of past growth performance, i.e. it is difficult to be the fastest growing company more than 7 times, which makes sense because fast growth is difficult to sustain. It is not clear whether the number of years a company has been on the list affects growth. Generally it looks like there isn’t much of a trend here.
The strongest relationship observed was between 2014 revenue and growth dollar (with r value of 0.95) closely followed by the relationship between 2014 state population and 2014 state total revenue (r value of 0.94).
I removed 5 outliers who revenues were greater than $3 Billion so I could see the general trend in the data. However, I would like to see only the outliers to determine what kinds of trends are observable from the growth and industry they are in.
Industry has too many values (25 levels). I need to group the existing industries into overarching categories to plot using color. If not, the values will be too difficult to see on a scatterplot. I grouped industries based on the [Global Industry Classification Standard],(http://en.wikipedia.org/wiki/Global_Industry_Classification_Standard) developed by MSCI and S&P.
## [1] "Advertising & Marketing" "Business Products & Services"
## [3] "Computer Hardware" "Construction"
## [5] "Consumer Products & Services" "Education"
## [7] "Energy" "Engineering"
## [9] "Environmental Services" "Financial Services"
## [11] "Food & Beverage" "Government Services"
## [13] "Health" "Human Resources"
## [15] "Insurance" "IT Services"
## [17] "Logistics & Transportation" "Manufacturing"
## [19] "Media" "Real Estate"
## [21] "Retail" "Security"
## [23] "Software" "Telecommunications"
## [25] "Travel & Hospitality"
Since I work in the IT/Software industry, I am curious about whether the IT Services category has different traits compared to the other industries. I decided to group all 6 other categories that were not IT into “Non-IT” category and plot variables to see if any trends emerged.
## [1] "Consumer Sector" "Industrials"
## [3] "Information Technology" "Energy"
## [5] "Financials" "Health Care"
## [7] "Telecommunication Services"
Generally, growth for Non-IT companies exceeds growth for IT companies except when revenue is around $250 million. Positive relationship between revenue and growth.
It turns out at the lower levels of revenue (revenue of $15 Million or less), there is not a noticeable difference in trends between the IT Services industry and all other industries. However, when evaluating almost the complete dataset of companies, that is revenue at $2 Billion or less which excludes a half a dozen outliers (which have over $2 Billion in revenue), IT Services industry category shows it lags behind in revenue and growth compared to companies in non-IT industry categories. The plots suggest IT Services category has a logarithmic growth that exceeds non-IT industries at around $250 Million in revenue but non-IT Services category exceeds the IT Services category after the $500 Million mark.
It makes sense that the top 6 highest revenue and growth companies excluded from the plot are companies not IT companies. Rather they in the industry categories of Energy, Health, Construction, and Real Estate. These industries typically require bigger capital investments with higher barriers of entry because they are offering expensive products and services. Logically, I conclude that they are likelier to produce higher revenue, and the U.S. spends a lot of money in energy, health care, real estate, and construction sectors. Also note the revenue doesn’t equal profit. These industries could have high revenues with low profit margins precisely because of the infrastructure and capital expenses required to operate the company.
Let’s take a look at the relationship among the variable workers, revenue, growth, and other factors that might shed light on growth of companies.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 24 50 209 125 34220
##
## Pearson's product-moment correlation
##
## data: companies$workers and companies$growth_dollar
## t = 16.0263, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1945545 0.2472854
## sample estimates:
## cor
## 0.2210815
Companies that have lower number of employees also tend to have a lower amount of revenue and therefore lower growth in dollars (since growth is derived from revenue). There is a consistent trends of the companies with the greatest number of employees also having the highest revenue and growth. The second plot with revenue and percentage growth colored by workers shows that companies with fewer employees also have lower percentage growth. This conclusion dispels the theory I proposed earlier that smaller companies with fewer employees and lower revenue in dollars have greater percentage growth because every incremental revenue dollar accounts for a greater degree of growth.
Rather, the trend seems to point to the fact that companies with the greatest number of employees also tend to generate the most revenue and growth calculated in dollars and percentage. This conclusion is further supported by the scatter plot that shows the relationship between revenue and growth per worker. There is quite clear grouping of revenue and growth per worker based on the workers groups.
The top states are all very similar in their distribution of growth by revenue. Some outliers differ but the boxplots are generally almost the same suggesting that there is not much of a difference across the different quartiles.
There isn’t a clear trend when it comes to which states and industries. Among the top states, it doesn’t appear any one has a certain industry overrepresented. It also seems like the growth by revenue trends are consistent across all states - positive strong relationships. No states have increasing revenues but downward growth. State probably doesn’t play a big role in determining growth for a company. This is surprising as I assumed states like California and perhaps Texas or New York would be favorable towards the tech industries.
The effect of industry on revenue and growth colored by number of employees. The top 3 industries have the densest points so the sample size is greater. But there is no consistent pattern across industries when it comes to workers, i.e. no industry has particularly small numbers of employees for very large numbers. All industries have a range of employee numbers. However, employee numbers do support the previous findings that the categories of smaller numbers of employees tend to have lower income and lower growth.
This plot also supports previous conclusion that Healthcare, Energy, and to some degree Industrials and Financials industries exceed the other industries in the growth and revenue they achieved. There are companies in these industries in the upper right most corner, which the IT, Consumer Sector, and Telecommunication Services industries don’t reach.
Previous analysis showed there is a strong relationship between revenue and growth. Multivariate plots also showed that the industry matters: Healthcare, Energy, and Industrials achieve the highest degree of revenue and growth compared to IT, Consumer Sector, and Telecommunication Services industries. The Financials industry can be grouped with the higher revenue and growth group but it was represented as one of the outliers in the revenue versus growth plots.
The greater the number of employees at a company, there more likely the company will have greater revenue and growth.
There is a positive relationship between revenue and growth which dispels the previous conclusion that smaller companies that may have less growth but it is represented in higher percentage growth. The same companies with the greatest revenue also had the high growth measured in dollars and percentage.
The multivariate plots suggest that the top states have different industries represented. I though California would stand out with more Information Technology companies and Texas would have more Energy companies. But these stereotypes don’t seem to hold: the top states generally have a diverse range of industries which suggest the trend is that they have many fast growing companies in different sectors rather than companies specializing in one or two specific industries.
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Attaching package: 'memisc'
##
## The following object is masked from 'package:plyr':
##
## rename
##
## The following object is masked from 'package:scales':
##
## percent
##
## The following objects are masked from 'package:dplyr':
##
## collect, query, rename
##
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
##
## The following object is masked from 'package:base':
##
## as.array
##
## Calls:
## m1: lm(formula = I(growth_dollar) ~ I(revenue2014), data = companies)
## m2: lm(formula = I(growth_dollar) ~ I(revenue2014) + workers, data = companies)
## m3: lm(formula = I(growth_dollar) ~ I(revenue2014) + workers + industry,
## data = companies)
## m4: lm(formula = I(growth_dollar) ~ I(revenue2014) + workers + industry +
## state_l, data = companies)
##
## ===============================================================================================================================================================
## m1 m2 m3 m4
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1374603.674** 1817698.230*** 2007177.219 1065470.861
## (478521.520) (480417.066) (1538968.379) (4733320.471)
## I(revenue2014) 0.534*** 0.539*** 0.538*** 0.538***
## (0.003) (0.003) (0.003) (0.003)
## workers -3134.358*** -3089.363*** -3080.919***
## (447.373) (452.749) (454.386)
## industry: Business Products & Services/Advertising & Marketing -3119393.552 -3021343.377
## (2200078.743) (2220804.403)
## industry: Computer Hardware/Advertising & Marketing -2956015.132 -1562875.761
## (5822186.461) (5848354.253)
## industry: Construction/Advertising & Marketing -5316741.855 -5449915.348
## (2790909.334) (2812350.822)
## industry: Consumer Products & Services/Advertising & Marketing 4605429.609 4205288.581
## (2652069.212) (2672826.611)
## industry: Education/Advertising & Marketing -1322579.756 -1642664.045
## (4013726.247) (4033584.236)
## industry: Energy/Advertising & Marketing 9430816.841** 8727480.348*
## (3435529.337) (3490578.903)
## industry: Engineering/Advertising & Marketing -229146.377 -243445.134
## (4313868.518) (4348440.153)
## industry: Environmental Services/Advertising & Marketing -1925372.873 -1686596.075
## (4713620.455) (4758270.019)
## industry: Financial Services/Advertising & Marketing 1834675.721 1601768.627
## (2495622.481) (2517124.263)
## industry: Food & Beverage/Advertising & Marketing 2227445.858 2151869.828
## (3244201.932) (3261328.033)
## industry: Government Services/Advertising & Marketing 1009980.742 1069036.375
## (2724844.977) (2970302.187)
## industry: Health/Advertising & Marketing 980210.080 758142.041
## (2285744.618) (2309605.321)
## industry: Human Resources/Advertising & Marketing -40987.327 -256316.214
## (2806357.091) (2830929.000)
## industry: Insurance/Advertising & Marketing -1970000.651 -1841535.891
## (4287722.067) (4314119.050)
## industry: IT Services/Advertising & Marketing -1227918.923 -1174120.639
## (1957231.860) (1983913.330)
## industry: Logistics & Transportation/Advertising & Marketing -1653283.637 -1696519.910
## (3243380.467) (3279961.492)
## industry: Manufacturing/Advertising & Marketing -2845128.785 -2216298.539
## (2725135.668) (2760189.801)
## industry: Media/Advertising & Marketing -446152.119 -203801.322
## (4498253.851) (4521311.731)
## industry: Real Estate/Advertising & Marketing 3066524.290 2825411.406
## (3250742.304) (3273054.737)
## industry: Retail/Advertising & Marketing 333955.676 537008.862
## (2780591.476) (2797243.538)
## industry: Security/Advertising & Marketing 994491.090 229456.741
## (4269853.654) (4297055.677)
## industry: Software/Advertising & Marketing 6740.409 -147974.928
## (2353508.679) (2373931.684)
## industry: Telecommunications/Advertising & Marketing 1292244.152 872129.584
## (3214001.645) (3233572.572)
## industry: Travel & Hospitality/Advertising & Marketing -5997248.023 -6038258.490
## (4436137.037) (4489594.078)
## state_l: Alaska/Alabama -2607516.435
## (33113466.400)
## state_l: Arizona/Alabama -1886274.954
## (5495388.746)
## state_l: Arkansas/Alabama -2387133.685
## (11803045.516)
## state_l: California/Alabama 1830031.578
## (4626353.336)
## state_l: Colorado/Alabama 4963593.705
## (5343438.143)
## state_l: Connecticut/Alabama 1412658.825
## (6542115.319)
## state_l: Delaware/Alabama 720115.417
## (8930072.472)
## state_l: District of Columbia/Alabama 234794.250
## (6628397.086)
## state_l: Florida/Alabama 1132774.546
## (4829854.286)
## state_l: Georgia/Alabama 631087.384
## (4999628.870)
## state_l: Hawaii/Alabama 138302.593
## (15363414.166)
## state_l: Idaho/Alabama 2119383.852
## (10479894.224)
## state_l: Illinois/Alabama 824965.051
## (4942887.054)
## state_l: Indiana/Alabama 2784133.194
## (5883750.416)
## state_l: Iowa/Alabama -231412.095
## (7465662.381)
## state_l: Kansas/Alabama -518923.123
## (7044628.801)
## state_l: Kentucky/Alabama 1194550.328
## (7397216.024)
## state_l: Louisiana/Alabama 80548.161
## (6765981.146)
## state_l: Maine/Alabama 5609960.141
## (9836688.870)
## state_l: Maryland/Alabama 1058750.260
## (5318098.596)
## state_l: Massachusetts/Alabama -260999.279
## (5098812.725)
## state_l: Michigan/Alabama -41141.703
## (5311293.644)
## state_l: Minnesota/Alabama -5713901.278
## (5601582.586)
## state_l: Mississippi/Alabama -2529426.506
## (10485781.564)
## state_l: Missouri/Alabama -700946.038
## (5908223.114)
## state_l: Montana/Alabama -2855999.433
## (15366070.812)
## state_l: Nebraska/Alabama 1241909.562
## (7637652.213)
## state_l: Nevada/Alabama -1244580.322
## (7391679.863)
## state_l: New Hampshire/Alabama 3853167.050
## (8296964.814)
## state_l: New Jersey/Alabama 2443072.845
## (5138906.913)
## state_l: New Mexico/Alabama 767138.638
## (14112750.199)
## state_l: New York/Alabama 535380.615
## (4805534.735)
## state_l: North Carolina/Alabama 626474.932
## (5226074.431)
## state_l: North Dakota/Alabama -147291.799
## (14125189.830)
## state_l: Ohio/Alabama 575896.731
## (5110382.385)
## state_l: Oklahoma/Alabama -2379130.979
## (7476180.505)
## state_l: Oregon/Alabama 209199.022
## (6139400.096)
## state_l: Pennsylvania/Alabama 1051972.893
## (5049539.864)
## state_l: Puerto Rico/Alabama 2168871.361
## (23625913.352)
## state_l: Rhode Island/Alabama 1885067.291
## (9126225.490)
## state_l: South Carolina/Alabama 823112.477
## (6276706.842)
## state_l: South Dakota/Alabama -1503832.357
## (23633610.162)
## state_l: Tennessee/Alabama -3674856.239
## (5779448.024)
## state_l: Texas/Alabama 5993154.622
## (4745857.125)
## state_l: Utah/Alabama 2204590.821
## (5686504.560)
## state_l: Vermont/Alabama -1400313.358
## (19498491.091)
## state_l: Virginia/Alabama 1075027.639
## (4812382.520)
## state_l: Washington/Alabama 1491777.894
## (5473415.760)
## state_l: West Virginia/Alabama -2241460.914
## (17017182.657)
## state_l: Wisconsin/Alabama -16355500.032**
## (5942247.357)
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.897 0.898 0.899 0.899
## adj. R-squared 0.897 0.898 0.898 0.898
## sigma 32926060.992 32768803.256 32741283.113 32773391.395
## F 43450.196 21958.659 1693.215 578.653
## p 0.000 0.000 0.000 0.000
## Log-likelihood -93642.568 -93618.130 -93601.893 -93581.531
## Deviance 5418459211168380928.000 5365750950705011712.000 5331014325546490880.000 5287770588417689600.000
## AIC 187291.135 187244.259 187259.785 187319.061
## BIC 187310.687 187270.328 187442.267 187827.402
## N 5000 5000 5000 5000
## ===============================================================================================================================================================
I built a linear model using the variables that my analysis highlighted had some effect on growth: revenue, industry, workers, state. The linear model appears to be relatively strong with an R-squared value of 0.9. The variables in this linear model account for about 90% of the variance in growth. Adding state to the linear model did not boost the R-squared value much which confirms that the state feature has little impact on the growth of a company. However, revenue is by far the greatest predictor of growth.
The distribution of growth and revenue appears to follow a normal distribution after performing a log transformation on both growth and revenue. Growth follows a more normal distribution than Revenue which appears to be truncated on the left side. This could be a result of the dataset containing only 5000 companies.
Companies in non-IT-related industries experience greater revenue and growth than companies in the IT industries (including IT Services, software, and computer hardware). This trend is less prevalent in lower levels of revenue and growth where IT and non-IT-related companies are more competitive. In fact, IT industries exceed non-IT industries in growth by revenue at around $150-200 Million mark in terms of revenue and growth in dollars.
But at the highest levels of revenue, around $1.2-2 Billion, non-IT industries far outperform IT industries in terms of growth by revenue. This is likely due to capital-intensive and large-scale operations involved in healthcare, energy, and construction industries.
I decided not to perform a log10 transformation on the x and y axes for this plot because I was more interested in the upward trends extrapolating from the smoothing function and therefore wanted to see where the line would go beyond $2 Billion, rather than focusing on what is under $250 Million.
Calculating growth per worker is a better indicator of the efficiency of a company because it evens the playing field between large and small companies. The plot shows that companies with fewer employees tend to have more efficiency, that is, those companies generate less revenue but still achieve the highest strata of growth. In comparison, larger companies with more employees and greater revenue are generally achieving a lower level of growth. This trend is a continuum as there is some spread from left to right indicating some smaller companies with fewer employees are not as efficient as larger companies with greater number of employees.
The companies dataset contains information on the 5000 fastest growing private companies in the U.S. in 2014. I began my analysis with performing descriptive statistics to understand what the variables in the dataset mean and their distributions.
I started with the dominant question: what affects company growth? Since this dataset was about the fastest growing private companies, I was looking for variables that might shed light on what increases or decreases growth. I looked at different relationships between multiple features and eventually created a linear model using the variables revenue, industry, and workers which emerged as the most influential features that affect growth based on the plots and analysis of the dataset.
My conclusion is that revenue is the most robust feature in my dataset to predict growth. It accounts for almost 90% of the variance in growth. The industry and number of employees of a company also has an effect, albeit, a much smaller effect than revenue.
The hardest part was understanding how different variables related to each other and how to tease apart their effect on growth. It was clear from the beginning that revenue has a direct effect on growth since growth is a calculation of the difference in revenue from the previous and current year.
However, I struggled to tease about how much other variables would affect growth and whether there was a meaningful relationship among other features, such as industry, state, workers, etc.
I also found adding the additional population data helpful in the beginning. However, once it became clear that the feature state played a less meaningful role compared to other features, I realized state population was a deadend.
I found a lot of interesting analysis when creating multivariate plots which helped clarify and drill into the relationship between revenue and growth. The plots that included industry and workers in the growth by revenue analysis offered more nuanced insight into what kinds of companies (for example, larger or smaller) in which industries had higher levels of growth.
I also spend a lot of time teasing apart the different perspectives on growth, for example growth measured in dollars, percentage, or by the number of workers, to understand the different facets on this feature. It helped to inform later analysis.